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Microbial pathogen genomes - 
new strategies for identifying 
therapeutics and vaccine targets 

Douglas R. Smith 

Advances in high-throughput DNA-sequencing techniques have given us the 
unprecedented ability to rapidly determine the nucleotide sequences of entire 
bacterial genomes. The application of these methods to the genomes of microbial 
pathogens, combined with efficient analytical tools and genome-scale approaches 
for studying gene expression, is revolutionizing our approach to the selection of 
targets for drug screening and vaccine development This is bringing new life to this 
important, but long-neglected, field of research. 



The decision, several years ago, by the US Department 
of Energy, the National Institutes of Health (NIH) and 
several international funding agencies to embark upon 
programs to map and sequence the human genome 
has led to a number of important technological 
advances that are beginning to have an impact in other 
areas of biology. Among these advances are the devel- 
opment of automated methods for the generation of 
large amounts of raw DNA-sequencing information, 
computer software for rapidly processing and analyz- 
ing primary sequence data, and techniques for the 
rapid assembly of shotgun sequencing reads, even from 
entire bacterial genomes. Efficient algorithms for simi- 
larity searching allow the rapid identification of pro- 
tein-encoding sequences that are homologous to other 
genes, the sequences of which are held in public and 
private databases; as from April 1996, approximately 
500 megabases (Mb) of nucleotide sequence were 
contained in GenBank, and approximately 200 000 
sequences were held in the SWISS-PROT/Genpept/ 
PIR database of non-redundant proteins. Combined 
with the wealth of biochemical information that 
is archived in public databases, it has become possible 
to describe rapidly the full repertoire of genes in a mi- 
crobial genome, and to predict many of the meta- 
bolic pathways that an organism may utilize. 

Progress in this field has been stimulated by the inter- 
ests of the biotechnology and pharmaceutical indus- 
tries in using genome-sequencing data as a basis for 
drug discovery. In turn, this has led to the develop- 
ment of proprietary databases containing genomic 
information, which provide the basis for in silico experi- 
ments to identify novel targets for drugs, and for 
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laboratory experiments to identify genes that perform 
critical functions. This article summarizes some recent 
developments in this important area, focusing on bac- 
terial sequences, and provides examples to illustrate 
how genome-sequencing information from microbial 
pathogens can be used to select targets for vaccine and 
drug development. The overall process used to pro- 
ceed from sequence generation to target validation is 
illustrated in Fig. 1. 

Large-scale sequencing of bacteria] genomes 

Many laboratories use automated sample-prepar- 
ation techniques and fluorescence-based gel readers 
[such as that produced by Applied Biosys terns Inc., 
(ABI); Foster City, CA, USA] for the large-scale 
sequencing of bacterial genomes. These instruments 
have the advantage that they are efficient, and relatively 
easy to set up and operate. A few laboratories use com- 
puter-assisted multiplex sequencing to achieve the 
same end 1 . In multiplex sequencing, samples consist- 
ing of pools of up to 20 plasmids are processed through 
sample preparation and gel electrophoresis, and the 
resulting sequences are determined from electroblots 
of the gels by hybridization with radioactive or fluor- 
escendy labeled probes. This technique can be used to 
generate 40 films (or digitized images) from each 
sequencing gel. Although multiplex sequencing is effi- 
cient at producing large amounts of 'shotgun 1 data, it 
is more difficult to set up and operate in the labora- 
tory than is fluorescence-based gel sequencing, and it 
is not suited to directed-finishing strategies. ABI 
machines are used in the authors laboratory to gener- 
ate primer-directed reads for finishing and gap closure. 

During the past year, a group at The Institute for 
Genomic Research (TIGR; Gaithersburg, MD, USA) 
reported the complete sequences of Haemophilus 
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influenzae (1.8 Mb), a major cause of respiratory infec- 
tions and meningitis, especially in children*, and of 
Mycoplasma genitalium (0.6 Mb), which causes ure- 
tnnns^. Approximately 1.6Mb of contiguous 
sequence from the 4.7 Mb Escherichia coli genome has 
been published*, and the sequencing of a further 2 Mb 
was reported at the 1995 Genome Sequencing and 
Analyse VII (GSA-VII) meeting*. The genome of 
Helicobacter pylori (1.7 Mb), the major cause of 
stomach ulcers, has been sequenced by Genome 
mS?*"?" Cor P orari °n (GTC; Waltham, MA, 
UiA) under a privately funded microbial-pathogen 
sequencing program. More than half (1.5 Mb) of the 
2.8 Mb genome of Mycobacterium leprae (the eriologic 
agent of leprosy) has also been sequenced by GTC 
and is available through GenBank, the GTC web 
site <http//.-www.cric.com>, and through MycDB 
<http://www.biochem.kth. se/MycDB.html> which 
mforStioS° b3Cterial Sen ° me ""PP^sndsequence 
Other microbial pathogens that are currendy being 
sequenced include Neisseria gonorrhoeae (Universiry of 
Oklahoma, Norman, OK, USA), Streptococcus pyogenes 
JJmversity of Oklahoma), Treponema pallidum 
(Umversity ofTexas, Houston, TX, USA, and TIGR) 
Mycobacterium tuberculosis (GTC and the Sanger 
Centre Hinxton, Cambridge, UK), and Staphykcoccus 
aureus [GTC, and Human Genome Sciences (HGS; 
Rockville, MD, USA)]. 

In addition to these pathogens, the genomes of sev- . 
era! archaebactena and other non-pathogens are being 

^ ThCSe indBde Mecha "ococcus janaschii 
U lLrK), Pyrococcus Juriosis (University of Utah, Salt 
Lake City, UT, USA), Sulfolobus solfamricus (Dalhousie 
Umyenity, Halifax, Nova Scotia, Canada), and 
Pyrobaculum aerophilum (California Institute of 
Technology, Pasadena, CA, USA, and Universiry of 
California, Los Angeles, CA, USA). The 1.7 Mb 
genome of the archaeon Methanobacterium thermo- 
autotrophicum is near completion at GTC (Ref. 7) 
Approximately 2Mb of the 4.1Mb Bacillus subtilis 
genome has now been sequenced by a consortium of 
European and Japanese laboratories, and the project 
may be completed by the end of 1996 (Ref. 8) 
Approximately 1 Mb of genomic sequence from the 

^ SSIi0me of dle cyanobacterium Synechocystis 
sp. 6803 was recendy published 9 . 

Within the next couple of years, therefore, we can 
expect an explosion of bacterial-genome sequence 
information from species representing a variety of 
pnylogeneac lineages, including many pathogens. 

Pharmaceutical companies have shown considerable 
interest in using pathogen genomics to facilitate the 
development of vaccines and small-molecule thera- 
peutics. For example, researchers at Glaxo Wellcome 
have sequenced a substantial fraction of the H pylori 
genome co assist in the process of drug discovery. Over 
me past year, GTC has formed two research alliances 
with pharmaceutical companies to take advantage of 
sequences from microbial pathogens: one with 
Astra AB, focusing on the development of new anri- 
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Figure 1 

Row diagram illustrating the process by which a microbial genome sequence is 
analysed and the information is used to direct experiments and aid in target selec- 
tion for therapeutics development TTie individual steps are referred to throughout 
the text In the case of vaccine candidates, gene products from selected targets are 
expressed and tested in animal models. 

biorics and vaccines to treat H. pylori infection, and 
one with Schering-Plough (Union, NJ, USA), to 
develop broad-spectrum antibiotics and vaccines. 
Although the genomic route to drug discovery for 
bacterial pathogens is new and remains unproved, the 
basic paradigm (outlined below) of gene identification, 
followed by functional analysis and drug screening is 
well established. Thus, it is likely that more companies 
will become involved, and that in the the future, ad- 
ditional research alliances between genomics com- 
panies and the pharmaceutical industry will material- 
ize in this area. 



From sequence to genes 

The first task when confronted with an entire bac- 
terial-genome sequence, is to identify all the genes. 
This can >e accomplished using a variery of tech- 
niques, but the most successful approaches use a combi- 
nation of reading-frame and codon-usage analvsis, 
together with sirnilaricy searching, to identify putative 
genes with homology to previously described se- 
quences. Commonly used tools include GeneMark 10 
GenomeBrowser", BLAST (Ref. 12), and highly 
parallelized implementations of the Smich-Waterman 
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alignment, such as BLAZE, or MPsrch (Ref. 13). In 
general, organism-specific codon usage is highly pre- 
dictive for bacterial genes, but its effective use depends 
on the existence of sufficient information to generate 
accurate codon-usage matrices. In some cases, subsets 
of genes within an organism will exhibit codon-usage 
patterns that deviate significandy from the norm 14 . 
Such genes are thought to represent evoluoonarily 
recent acquisitions by phage transduction, conju- 
gation, or some other form of horizontal transfer from 
other organisms. If enough of these genes are present, 
codon-usage tables of genomic subsets can be con- 
structed to identify them. Translarional start sites can 
be identified by the occurrence of start codons that 
coincide with abrupt changes in codon usage, the in- 
itiation of homology to previously characterized 
genes, or the presence of Shine-Dalgarno sequences 13 . 
Automated analysis tools (such as GenomeBrowser 11 ) 
that provide a graphical display of open reading fames 
(ORFs), codon usage, database homologies and other 
features, make the task of identifying bacterial genes 
and their relationships with each other in the genome 
relatively straightforward. With the increasing pace of 
bacterial-genome sequencing, there is an emerging 
need for second-generation tools that will automate 
most of the laborious annotation process. 

From genes to function 

The second phase in the analysis of bacterial 
genomes is to identify the function of as many genes 
as possible. Currendy sequence homology is the most 
powerful cool. A high degree of homology between 
the putative translation product of a newly identified 
gene and an enzyme whose function has been 
thoroughly studied in other organisms, provides strong 



support for the function of that protein, especially if it 
is the only homolog in the genome under scrutiny 
Other useful tools include programs that identify 
sequence motifs from databases such as PROSITE 
(Ref. 16), BLOCKS (Ref. 17), BEAUTY (Ref. 18) 
and ProDom (Ref. 19). If one is attempting to 
identify vaccine candidates, then exarruning highly 
expressed cell-surface proteins is relevant, so it is then 
useful to know whether a protein contains a secretion 
signal, even if nothing else is known about it. Although 
the tools described here are very good at identifying 
homologies, 25-40% of the genes in a bacterial 
genome typically fail to show significant similarity 
with known proteins. 

Once the set of sirmlarity-searching tools has been 
exhausted, one must return to molecular biology to 
further elucidate the function and expression pattern 
of predicted genes. Commonly used approaches to 
identifying essential genes in an organism include: the 
use of gene knockouts, disruptions using transposon- 
mediated mutagenesis, or homologous recombination 
with disrupted gene-constructs that contain an anti- 
biotic-resistance cassette. Gene disruptions can be 
generated in a variety of ways, including sophisticated 
'hit-and-run' approaches that interrupt a gene with- 
out introducing polar effects into downstream ORFs 
(Ref. 20). However, a gene-by-gene approach to the 
study of a whole genome is certainly time consuming 
and labor intensive. 

The availability of large amounts of genome- 
sequence information has stimulated the development 
of new approaches to functional analysis on a genomic 
scale. This has been particularly true for researchers 
investigating yeast, where a concerted effort is being 
made to ascertain the function of every OBJ 3 in the 
genome. Such strategies include the conceptually 
simple, but technologically advanced, technique of 
making microarrays of polymerase chain reaction 
(PCR)-amplified gene sequences on glass slides to 
allow the fluorescence-based detection of quantitative 
hybridization signals from labeled cDNA probes on 
large numbers of genes simultaneously - perhaps even - 
all the genes of an organism 21 . An ingenious PCR- 
based approach to efficient sequence-signature-based 
expression analysis has recendy been demonstrated 22 . 
For example, a technique termed 'genetic finger- 
printing' promises to replace individual gene knock- 
outs by a global transposon-mutagenesis approach 23 . 
Insertions are induced en masse in a strain of interest, 
the strain is grown under a variety of conditions, 
and PCR products are analysed to identify genes in 
which transposon hops are under-represented because 
the genes are required for growth 23 . A conceptually 
similar dropout technique, which uses tagged trans- 
posons to identify the Salmonella typhimurium genes 
required for virulence in a mouse model, has been 
described 24 . 

Techniques that probe subsets of genes for a specific 
functionality, such as secretion or induction during 
growth in the host, have also been described. These 
techniques provide clones from which signature 
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sequences can be derived, so that corresponding 
genes can be identified by comparing them with 
the genomic sequence. The rVET {in vivo expression 
technology) technique, which detects gene fusions 
that result in the in vivo selectable expression of a 
defective purA gene or antibiotic-resistance marker, 
has been used to identify Salmonella genes, the expres- 
sion of which is induced when the pathogen is grown 
in mice 23 . Finally; protein microsequencing 26 and 
mass-spectrometry-based peptide analysis 27 have 
been used to identify protein components (e.g. outer- 
membrane proteins) in partially purified mixtures, 
or to identify specific proteins separated by two- 
dimensional gel electrophoresis. Sequences generated 
in this manner can be used to correlate specific pro- 
teins with the gene sequences from which they are 
expressed. 

Target selection and validation 

The techniques described in the previous section 
can be used to identify genes in specific functional 
categories that may represent good targets for drug or 
vaccine development. In general, when developing 
new antibiotics, one is interested in genes that are 
essential under all growth conditions (and preferably 
even in quiescent cells), and for which inhibitors with 
useful chemical properties, such as permeability and 
low toxicity, can be identified. One advantage of 
having the entire sequence of a genome is that targets 
can be prioritized in terms of their activities and the 
properties of compounds that are known to interact 
with them. Even with the results of knockout or in 
vivo expression experiments, additional biological 
information can aid in narrowing down the field of 
choices. For example, genes can be selected on the 
basis of their probable roles in intracellular metab- 
olism. Databases, such as EcoCyc (Ref. 28) or PUMA 
(Ref. 29), that describe known metabolic pathways 
can be helpful in this regard. Detailed structural infor- 
mation about homologs of identified genes (deters 
mined using the Protein DataBank 30 ) can be used to 
assist in the molecular modeling of inhibitors (some 
resources for molecular modeling can be found at 
Ref. 31). 

As more genomes are sequenced, it will become 
possible to identify genes that are unique to a par- 
ticular organism or group of organisms, or genes 
that are conserved in certain groups. Thus, for 
example, it will be possible to use electronic com- 
parison to identify genes that are present in H. pylori 
but not in other gut-dwelling bacteria such as E. aff, 
providing a basis for the development of antibiotics 
specific to H. pylori. Although combinatorial chem- 
istries promise to speed up our ability to synthesize 
and screen large numbers of unique chemical entities, 
the sequence-based approach described here provides 
an avenue for the rational identification and selection 
of key targets for therapeutics development. Ulti- 
mate validation of the targets will, of course, require 
additional experiments such as protein expression, 
biochemical-assay development and animal 



studies to identify those with the most useful 
properties or inhibitors. 
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